Rodriguez-Fontenia et al. BMC Genomics 2014, 15:408 
http://www.biomedcentral.com/1471-2164/15/408 



(bmc 



Genomics 



METHODOLOGY ARTICLE Open Access 



Genetic distance as an alternative to physical 
distance for definition of gene units in association 
studies 

Cristina Rodriguez-Fontenia, Manuel Calaza and Antonio Gonzalez 



Abstract 

Background: Some association studies, as the implemented in VEGAS, ALIGATOR, i-GSEA4GWAS, GSA-SNP and 
other software tools, use genes as the unit of analysis. These genes include the coding sequence plus flanking 
sequences. Polymorphisms in the flanking sequences are of interest because they involve cis-regulatory elements or 
they inform on untyped genetic variants trough linkage disequilibrium. Gene extensions have customarily been 
defined as ±50 Kb. This approach is not fully satisfactory because genetic relationships between neighbouring 
sequences are a function of genetic distances, which are only poorly replaced by physical distances. 

Results: Standardized recombination rates (SRR) from the deCODE recombination map were used as units of 
genetic distances. We searched for a SRR producing flanking sequences near the ± 50 Kb offset that has been 
common in previous studies. A SRR > 2 was selected because it led to gene extensions with median length =45.3 
Kb and the simplicity of an integer value. As expected, boundaries of the genes defined with the ± 50 Kb and with 
the SRR >2 rules were rarely concordant. The impact of these differences was illustrated with the interpretation of 
top association signals from two large studies including many hits and their detailed analysis based in different 
criteria. The definition based in genetic distance was more concordant with the results of these studies than the 
based in physical distance. In the analysis of 18 top disease associated loci form the first study, the SRR >2 genes 
led to a fully concordant interpretation in 17 loci; the + 50 Kb genes only in 6. Interpretation of the 43 putative 
functional genes of the second study based in the SRR >2 definition only missed 4 of the genes, whereas the 
based in the ±50 Kb definition missed 10 genes. 

Conclusions: A gene definition based on genetic distance led to results more concordant with expert detailed 
analyses than the commonly used based in physical distance. The genome coordinates for each gene are provided 
to maintain a simple use of the new definitions. 



Background 

Genes are the unit of analysis or interpretation of 
multiple genetic association studies. However, multiple 
operational definitions of genes coexist in current use. 
Some are restricted to the coding sequence but, most 
often, they are extended to include flanking sequences 
because they contain polymorphisms that are informative 
of variation in the coding sequence trough linkage dis- 
equilibrium (LD) or polymorphisms that are themselves 
functional by involving regulatory sequences. Here, we 
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have addressed the definition of these gene extensions for 
application in gene- or pathway-based association studies, 
gene-based interaction analysis and interpretation of large 
numbers of top association signals for meta-analysis or for 
gene- and pathway- enrichment analysis. 

Gene- or pathway- based association studies [1-8] con- 
sider the genes, not the individual SNPs, as the units of 
analysis. Association statistics for the genes are obtained 
by combining the statistics corresponding to the SNPs 
mapping to each of them. In this way, it becomes pos- 
sible to identify genes with multiple independent SNPs 
contributing to the trait but lacking significant associ- 
ation on their own. The same considerations apply to 
pathway- or gene-set analyses, where the association 
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signals from the genes in a pathway are combined. A 
similar situation appears in interaction analyses where 
the objective is to identify pairs of genes contributing to 
a trait in a way that deviates from the simple addition of 
their independent effects [9,10]. This type of analysis can 
be done at the individual SNP level but this is very 
sensitive to small variations in the study, and analysis at 
the gene or pathway level has been advocated as more 
reproducible [9-11]. In addition, extended gene defini- 
tions can be useful in analysis that by considering many 
top association signals find it impractical a detailed 
analysis of each of them. For example, when it is neces- 
sary to decide if associations from a large number of 
studies are coincident or not in the same gene [12], or 
when interpreting multiple association signals [13,14]. 

In all these situations, genes have been operationally 
defined as the coding sequence plus a fixed physical dis- 
tance in each direction. Length of the extensions has 
been from 0 to 500 [5,7] Kb, but most often of 20 
[8,13,14] or 50 [1-4,9] Kb. This is a practical solution 
that is used because of its simplicity, but this definition 
is subjective and not fit for many genes. Here, we 
propose a definition of genes that is equally easy to apply 
and has the advantage of including genetic distance 
in place of physical distance. Genetic distance is the 
relevant one because it determines LD between poly- 
morphisms [15-18] and, therefore, the information that 
SNPs in the extensions provide about un-typed variation in 
the coding or regulatory sequences. Genetic and physical 
distances are not interchangeable because the corres- 
pondence between the two is very variable along the 
genome [15-18]. We took genetic distances as standardized 



recombination ratios (SRR) from the deCODE recom- 
bination map [16], which is the most accurate available. 
The new extended gene definitions were compared with 
definitions based on physical distances to illustrate their 
advantages. They are made available in a text file with 
genome coordinates to facilitate their use. 

Results 

Setting a SRR threshold 

It is well known that the recombination rate is very 
irregular along the human genome [15-18]. This irre- 
gularity leads to a skewed distribution of SRR along the 
genome (Figure 1) [16] including a large fraction of bins, 
42.6%, with no recombination (SRR = 0) and 78.4% of the 
bins with less than the average (SRR < 1). Therefore, most 
of the recombination takes place in the remaining 21.6% 
bins. Analysis of the SRR distribution showed that exten- 
sions of genes based on an SRR > 2 have a median physical 
length of 45.3 Kb (IQR = 22.9-90.2 Kb). This median 
length is similar to the most common physical distance 
extension used until now, which is of 50 Kb. The SRR > 2 
is only found in a minor fraction of bins, 12.9%. The 
remaining 87.1% of the 10 Kb bins showed lower SRR. No 
detailed optimization of the SRR was attempted preferring 
to keep the simplicity of an integer value. 

Comparison of genetic and physical distance based gene 
definitions 

Concordance between the median length of the extensions 
based on SRR > 2 and the ± 50 Kb rule made possible a 
direct comparison. However, the new definitions obtained 
here account for recombination and are variable (Figure 2), 
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Figure 1 Distribution of the standardized recombination rate (SRR) In the human genome. Number of 10 Kb bins from the deCODE 
recombination map [15] within each interval of SRR values. 
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Figure 2 Length distribution of the 35 044 gene extensions according to the SRR>2 rule. The 5' and 3' extensions for eacln gene liave 
been separately considered. All followed the SRR > 2 rule except for 2669 of genes near telomeres and centromeres, where information is 
incomplete and that were replaced by the median length of extensions in their chromosomes; most of them in the 40-50 Kb range. 



not uniform. They go from less than 10 Kb (8.8% of the 
extensions) to more than 500 Kb (1.2% of the extensions). 
The distribution of extension lengths implies that most 
gene boundaries are discordant between the two defini- 
tions. In fact, only 21.3% of the extensions obtained with 
one definition are within ± 10 Kb of the obtained with the 
other, and even less frequently (6.1%) when the two exten- 
sions of a gene are considered simultaneously. 

We have used two large GWAS with multiple associ- 
ated loci to illustrate differences between the two gene 
definitions. However, these analyses should not be con- 
fused with an attempt to replace detailed analysis of 
GWAS results. First, we used the interpretation of 18 
top association signals from the 2007 WTCCC GWAS 
[19]. The authors of this study gave lists of relevant 
genes for each associated locus based on analysis of the 
associated SNPs and LD around the top signal. These 
lists include from 0 to 23 genes. The SRR > 2 definition 
led to Usts that were more concordant with the WTCCC 
GWAS than the obtained with the ± 50 Kb definition 
(Table 1 and Additional file 1: Table SI [20]). All the 
genes selected by the WTCCC authors were also included 
when applying the two definitions, but in some loci the 
gene definitions led to consider some extra genes. 
Specifically, the SRR > 2 definition included additional 
genes in one locus, whereas the ± 50 Kb definition in- 
cluded additional genes in 12 of the 18 loci (P = 0.00015 
for the comparison of fully concordant loci). In more 
detail, six loci included an extra gene according to the ± 
50 Kb rule (an example shown in Figure 3A); four loci 
included two extra genes with the ± 50 Kb definition 
(two of these loci shown in Figure 3B and C); an additional 
locus included 3 extra genes in the list obtained with the ± 
50 Kb definition (Table 1). The remaining locus was the 



Table 1 Number of genes in association regions of the 
WTCCC GWAS top hits [19] 



Chromosome 


Disease"" 


WTCCC 


SRR>2' 


± 50 Kb 


5pl3 


CD 


0 






Wq24 


CD 


1 






Wq25 


T2D 


1 






9p21 


CAD 


2 






Wq21 


CD 


3 






16ql2 


CD 


4 






16ql2 


T2D 


1 




+ 1 


5q33 


CD 


2 




+ 1 


1p13'' 


RA 


7 




+ 1 


1p13'' 


T1D 


7 




+ 1 


16pl3 


T1D 


8 




+ 1 


16pl2 


BD 


9 




+ 1 


1p3l 


CD 


1 




+ 2 


2q37 


CD 


1 




+ 2 


18pll 


CD 


1 




+ 2 


12q24 


TIP 


15 




+ 2 


12ql3 


TIP 


26 




+ 3 


3p21 


CD 


18 


+ 7 


+ 9 




Total 


107 


+ 7 


+ 26 



^These two loci overlap. 

'^CD^ Crohn's disease, T2D = Type 2 diabetes, C/AD = Coronary artery disease, 
/?/\ = Rheumatoid arthrtitis, 170 = Type 1 diabetes, 6D = Bipolar disorder. 
SRR > 2 for the gene definition extended to reach a cumulative SRR > 2 in 
each direction; and ± 50 Kb for gene definition extended to this length in 
each direction. Only changes in the number of genes, not in their identity, 
were observed between the three lists:- no differences with the genes 
highlighted by the WTCCC authors; + number of additional genes beyond the 
highlighted by the WTCCC authors. A full list of genes in each loci is available 
as Additional file 1 : Table SI . 
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Figure 3 Relevant genes in loci from the WTCCC GWAS depending on gene definitions. Image modified form Figure five of the WTCCC 
2007 GWAS paper with permision [19]. Horizontal lines corresponding to the genes overlaping with the region of association have been added in 
the middle panel of A) B) and C), in red for the ± 50 Kb rule and in green for the SRR>2 definition. No lines were added for panel D). Association 
region in each locus is limited by vertical dotted lines. The upper panel represents the SNPs (black dots, genotyped; grey dots, imputed) in function of 
position (X axis) and association (Y axis = -log,o(P)). Middle panel, centimorgans per Mb estimated from Phase II HapMap. The purple line shows the 
cumulative genetic distance (in cM) from the hit SNP. Lower panel, known genes in orange, top track shows plus-strand genes and the middle track 
shows minus-strand genes in condensed format. Below these tracks, sequence conservation in 1 7 vertebrates. Information in middle and lower panels 
is from the UCSC Genome Browser. Positions are in NCBI build-35 coordinates. Known genes in the hit region according the WTCCC paper are listed in 
the upper right part 



unique in which the three lists were discordant. This locus 
is particularly difficult because it shows a very low recom- 
bination rate and, therefore, a very wide region of asso- 
ciation with ill-defined limits (Figure 3D). In addition, it 
shows a high density of genes implying large differences 
when applying alternative criteria. Overall, there were 107 
genes in the 18 association regions according with de- 
tailed analysis done by the WTCCC authors. The defin- 
ition based on genetic distances led to fully concordant 
results except for the difficult locus, where no criterion 
can be considered certain (Figure 3D). In contrast, the 
definition based on ± 50 Kb included 26 additional 
genes (P = 0.00025 for the comparison of the number of 
extra genes). Nine of these extra genes were from the 
difficult locus in chromosome 3, but there were 17 extra 
genes in other loci. This example illustrates the very 



good concordance between post-hoc detailed analysis of 
each locus done by the WTCCC authors and the simple 
overlap with gene definitions based on genetic distances. 
It also illustrates the differences between this definition 
and the based on a fixed physical distance. 

The second study used to illustrate differences between 
the gene definitions is a large GWAS that included a 
selection of putative functional candidate genes for many 
of the associated loci [21]. The authors of this study used 
two criteria to identify these genes. The two were based in 
SNPs that are in high LD (r^ > 0.8) with the top associated 
SNP and with predictable functional relevance because 
they disrupt the protein sequence, nsSNPs, or the expres- 
sion of a nearby gene, c«-eQTLs. The search extended to 
the more than 3 000 genes mapping 1 Mb around the 75 
top associated signals. It led to 43 functional candidates 
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Table 2 Functional candidate genes that are missed depending on the gene definition 



Chromosome^ 



Phenotype 



van der Harst et al. 



SRR>2'^ 



: 50 Kb"^ 



nsSNP 

lq23 
lq44 
6p21 
6p21 
lOqll 
nql3 
llqlS 
12q24 
12q24 
16q22 
19pl3 
22qll 
22ql2 



eQTL 



4q27 
6p23 
6p21 
8pll 
lOqll 
llplS 
llql3 
llqlS 
15q22 
15q25 
16q22 
Uqll 
17ql2 
17q25 
18q21 
19pl3 
22qll 
22ql3 



MCHC 
RBC 
MCH 
RBC 
MCV 
MCV 
HE 
HE 
MCV 
RBC 
MCV 
MCV 
MCH 
I nsSNP 

MCV 
MCH 

RBC 
MCHC 
MCV 

HE 
MCV 

HE 
MCV 
MCHC 

RBC 
MCH 

RBC 

HE 
MCH 
MCH 
MCV 
MCV 
I eQTL 
Total 



0R6YI ORWZl, SPTAl 

TRIM58 

HFE 

HiA-DQA 1 

MARCH8 

RPS6KB2 

ARHGEFU 

SH2B3 

ACADS 

CTRL, PSMBIO 
UBXDl, NUDT19 
YDJC 

FBX07, TMPRSS6 



CCNA2 
GMPR 

HiA-DQA l/HiA-DQA2 

C8orf40 

MARCH8 

AKIPl/aioifie, NRIP3 

RPS6KB2, PTPRCAP/COROBi 

ARHGEFU 

PTPiADl 

DNAJA4 

DUS2L 

ERALl, TRAF4 

CDK12 

PGSl 

a8orf25 

CALR FARSA 

UBE2L3 

ECGFl 

25 
43 



HFE 



UBXDl 
YDJC 



HiA-DQA2 



0R6Y1 



HFE 



CFRL, PSMBIC 
UBXDl 



HLA-DQA2 
NRIP3 

PTPiADl 

DUS2L 
ERALl 



-5 
-10 



^Loci in chromosome 17q21 were excluded from analysis because it contains a common inversion polymorphism of approximately 900 kb in populations with 
European ancestry that shows exceptional LD and inheritance [26]. 

'^Phenotypes were: MHCH = Mean cell haemoglobin concentration, RBC= Red blood cell count MCH = Mean cell haemoglobin, MCV= Mean cell volume 
and HB= Haemoglobin. 

'^Genes that did not overlap with the SRR > 2 or the ± 50 Kb definition are indicated: - no differences with the functional candidate genes highlighted by van der 
Harst ef al. [20]; genes that were highlighted by van der Harst ef al. [20] but whose definition did not overlap with the top associated SNP. 
Functional candidates were selected in van der Harst ef al. [20] because they contained nsSNP (upper rows) or were regulated by eQTL (lower rows) in LD with 
the top associated SNP. 
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(Table 2). These putative functional candidates were prior- 
itized relative to other genes in the loci and the aim of our 
current test has been to evaluate the capacity of the two 
gene definitions to highlight them. We found that the 
SRR > 2 definition performed better than the definition 
based on ± 50 Kb (Table 2). The difference was due to a 
larger number of genes failing to be highlighted by the 
latter approach. In more detail: the two methods missed 
the same candidate genes in 3 loci, the SRR > 2 definition 
missed an additional candidate, but the ± 50 I<b missed 
other 7 candidate genes (P = 0.028; Table 2). In this way, 
the SRR > 2 definition missed 9.3% of the putative func- 
tional candidates, whereas the ± 50 Kb definition missed 
23.3% of them. 

Discussion 

The gene definitions based on genetic distances lead to 
extensions with different physical lengths, meaning that 
most gene definitions are discordant from any other 
based on a fixed length as we have shown for the SRR > 
2 and ± 50 Kb. The advantages of the new definition 
stem from the fact that physical distance is an inaccurate 
substitute of genetic distance as a measure of the rela- 
tionships between polymorphisms in the population 
[15-18]. This has been illustrated by showing a better 
performance in the interpretation of top association 
signals of the simple overlap rule based on SRR > 2 
definitions than in the traditional ± 50 Kb In conse- 
quence, the new gene definitions will improve gene- 
and pathway-based analysis by definition. The benefits 
are obtained by shortening gene extensions where 
recombination is high and by lengthening them where 
recombination is low. 

These gene definitions are not intended for interpret- 
ation of top association results in individual GWAS. In 
every case that a more detailed analysis is worth the 
extra effort, it should be done. Our choice of two GWAS 
as examples for illustration of the differences between 
the two gene definition approaches was motivated by 
the quality and reproducibility of GWAS, not to predi- 
cate the use of gene definitions in this field. The two 
GWAS were selected because they were of high quality, 
have found a large number of loci, have done detailed 
analysis of all the loci and have provided a full descrip- 
tion of the genes selected for each of them. These are 
uncommon characteristics and we were fortunate that 
the two studies used different approaches for selecting 
the putative functional genes allowing a more thor- 
ough comparison of the ± 50 Kb and the SRR > 2 gene 
definitions. 

Other gene extensions based on genetic distances are 
possible. We chose the threshold of SRR > 2 because it 
produced extensions of similar median physical length as 
the most used in previous studies. It will be inappropriate 



and arbitrary to compare other SRR thresholds with the ± 
50 Kb gene definition because these definitions will have 
different coverage of the genome and such comparisons 
will mix two components: differences in coverage and 
lack of correspondence between genetic and physical 
distances. By using the SRR > 2 rule we assured an 
equivalent coverage of the genome and the comparison 
was focused in the lack of correspondence between the 
two distances. Later we found that it led to concordant 
results with detailed post-hoc analysis in 17 of 18 WTCCC 
GWAS associated loci and to inclusion of 90% of 43 func- 
tional candidates for red blood cell associated loci. 
Therefore, this definition seems convenient although we 
do not claim that more appropriate SRR thresholds 
could not be found for specific applications at around 
this value. 

Our approach of using genetic distances in place of 
physical ones is widely applicable; but the gene definitions 
we provide are only directly applicable to Europeans. 
Other maps and specific genetic parameters will be 
necessary to study other ethnic groups. A genetic map 
for individuals of African ancestry has already been 
reported [22]. In addition, we have taken genetic dis- 
tances from the deCODE recombination map [16], but 
genetic maps based on the decrease of LD can be taken as 
alternatives. Currently the best of these maps has been 
produced with HapMap samples [17,18]. Although the 
recombination map and the LD based maps have a high 
degree of correlation, there are differences between them 
and some gene definitions will be discordant. Both maps 
were obtained on the NCBI36 genome assembly that 
has been replaced by more recent ones. However, con- 
version of the maps to current assemblies will decrease 
their accuracy and we consider that is more accurate to 
convert SNP data to the NCBI36 assembly (with liftover 
in the UCSC browser at http://genome.ucsc.edu/cgi- 
bin/hgLiftOver, Remap in the NCBI site at http://www. 
ncbi.nlm.nih.gov/genome/tools/remap, or a similar tool), 
perform definition of gene units with the SRR > 2 rule 
and run the intended analyses with the gene- or 
pathway-units. 

We used the RefSeq catalogue of protein coding genes 
for our analysis [23]. At least other four human gene sets 
are widely available, all of them different in some aspect 
although sharing sources of information and methodo- 
logies and being more or less interconnected [24,25]. 
These sets are in continuous revision to incorporate 
findings of new experiments and technologies and none 
claims to be complete or definitive. The RefSeq set has 
been manually curated after incorporating information 
from multiple sources. It is considered conservative and 
trusted and other annotation projects use it as one of 
their inputs. Among those using RefSeq input, the GEN- 
CODE set combines manual and automatic annotation 
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and is more comprehensive by including the transcripts 
detected in the ENCODE project. However, the number 
of the RefSeq, UCSC and GENCODE protein coding 
genes is very similar. Differences between these sets are 
remarkable only in the number of transcripts per gene and 
in the number of exons for each gene [24]. For example, 
the number of transcript per gene is much larger in GEN- 
CODE than in RefSeq , with the UCSC set in between. 
These differences could slightly modify the boundaries of 
the gene units defined taking the RefSeq set as reference. 
Therefore, we consider that the provided gene definitions 
are generally valid and will perform well but would be not 
fully consistent with other gene sets. 

Conclusions 

A definition of genes based on the coding sequence plus 
extensions whose length is given by genetic distances 
was shown to lead to more accurate results in the two 
sets of top association signals analysed. Use of this defin- 
ition is made as simple as the commonly used until now 
by the Ust of gene coordinates on the physical map that 
is provided. 

Methods 

Baseline gene definitions 

The RefSeq collection (UCSC RefSeq hgl8) of 18 022 
human protein-coding genes in autosome chromosomes 
and their map positions from the NCBI36 genome 
assembly, which corresponds to the used by the de- 
CODE recombination map [16], were taken as the 
bases to which extensions were attached. 

Recombination information 

Relative frequency of recombination for each 10 Kb 
bin of the human genome was obtained as standard- 
ized recombination rates (SRR) from the sex-averaged 
deCODE recombination map [16]. SRR are the result 
of dividing the recombination rate corresponding to 
each bin by the overall recombination mean for the 
genome. 

Gene definitions 

A SRR given gene extensions of median length approaching 
the most commonly used 50 Kb boundary was searched. 
SRR inside the coding sequence were not considered. 
Two gene definitions were compared: one based on phys- 
ical distance and the other based on genetic distance. The 
first included RefSeq sequences + 50 Kb in each direction; 
the second, the RefSeq definition extended in 10 Kb bins 
until the cumulative SRR that gave a similar median 
length. In this way, two extensions were generated for 
each RefSeq gene per definition, one for each end, 5 ' and 
3'. Genes placed near the telomeres and the centromere 
of each chromosome were incompletely covered in the 



deCODE recombination map. For the 2 669 genes in 
this situation, length of extensions based on genetic 
distances was made equal to the median length of all 
other extensions for this specific chromosome. These 
manipulations were done with PERL and Unix scripts 
that combined data from RefSeq and the deCODE re- 
combination map, established the extension limits and 
generated the tabulated plain text file including one 
row per gene, and the columns: chromosome, "left" 
boundary, "right" boundary and gene name, which is 
available for download as supplementary material in 
Additional file 2 [20]. 

Assessment of the gene definitions 

Differences in length of the extensions between the ± 50 
Kb and the SRR based definitions were calculated for all 
genes. In addition, the two gene definitions were applied 
to two large GWAS identifying multiple loci and using 
different criteria to highlight associated genes. Firstly, we 
examined 18 loci associated in the WTCCC GWAS [19]. 
This study examined about 14 000 patients, 2 000 of each 
of 7 different major diseases, and 3 000 healthy controls. 
Findings included 18 independent associations with 
p<5xlO"^ analysed with detail in Figure five of the 
Nature paper. The authors defined regions of associ- 
ation (indicated by dotted vertical lines) extending 
until p values of the SNPs returned to background 
levels and, where possible, to recombination hotspots. 
We took these 18 loci and generated lists of genes 
overlapping with the association regions considering 
that any part of the genes according to the ± 50 Kb or 
the SRR definitions falling in between the dotted verti- 
cal lines was sufficient to count it. Secondly, we con- 
sidered the 43 functional candidate genes in the 75 loci 
identified in the GWAS from van der Harst et al. ana- 
lyzing 6 red blood cell quantitative phenotypes in 135 
000 subjects [21]. These functional candidates were 
selected with two criteria: presence of nsSNP or of 
eQTL with r^ > 0.8 with the top associated SNP. One of 
the loci was excluded from our analyses due to its 
exceptionally high LD and unique inheritance pattern 
due to a common inversion polymorphism under posi- 
tive selection in Europeans of about 900 Kb in chromo- 
some 17q21 [26]. We took the remaining 43 functional 
candidate genes and tested if they would be included 
among the highlighted genes using the ± 50 Kb or the 
SRR gene definitions around the top associated SNP. 
The study of van der Harst et al. included other cri- 
teria for prioritizing candidate genes based in analysis 
of the bibliography or in the physical proximity that we 
did not consider here. Comparisons of the number of 
genes failing with each of the two gene definitions were 
done with the one-tailed Fisher exact test applied to 
the 2x2 contingency tables. 
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Additional files 



Additional file 1: Table SI. List of genes included in each of the top 
associated loci from the WTCCC GWAS (Ref 19) by the authors of the 
study, with the SRR > 2 definition and with the ± 50 Kb rule. This file is 
available in the Dryad Digital Repository, doi:l 0.5061/dryad.p58hb, 
http://doi.org/10.5061/dryad.p58hb. 

Additional file 2: Coordinates with the boundaries for the SRR > 2 
gene definitions. Tabulated plain text file including one row per gene, 
and the columns: chromosome, "left" boundary, "right" boundary and 
gene name. This file is available in the Dryad Digital Repository, 
doi:10.5061/dryad.p58hb, http://doi.org/10.5061/dryad.p58hb. 
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